Assembly Language
© Copyright Brian Brown, 1988-2000. All rights reserved.
| Notes | Home Page |


16-32 BIT MICROPROCESSORS
home page next page

This module is the individual work of Brian Brown. It may not be copied or used in any form without his permission.


OBJECTIVE
The study of advanced micro-processor architectures will aid the student in their understanding of complex systems and enable effecient software production.


INTRODUCTION 32bit micros (68020/30/40, iAPX286/386/486)
The common characteristics of 32bit micro-processors are,

Instruction pre-fetching is a technique which fills the processors internal instruction queue whilst it is busy decoding/executing the current instruction. The idea is to transfer the program instructions from system memory into high speed processor storage, so the processor can run as fast as possible without wait-states.

The trend in modern processors is to separate the decode/execution logic from the bus interface unit (which controls access to the system busses). This allows the BIU to fetch the next instruction whilst the DEU is handling the current instruction. An instruction queue is used to link the two seperated units together. The BIU trys to keep the instruction queue fully loaded, whilst the DEU trys to get its next instruction from the instruction queue. Immediate type of instructions will be executed faster with this approach. Other memory access instructions (like direct addressing) will require the DEU to ask the BIU to perform the operand fetch/write on its behalf.

CPU Performance can be increased by

  1. larger data busses
  2. higher clock speeds
  3. less clock cycles per data transfer
  4. a change away from the Von Nuemen architecture (eg, parallel, harvard, risc etc)

In general, options 1) and 3) have limits, which are quickly reached. The easy option is to increase the clock speed, thus having smaller cycle times. However, this requires faster RAM for the processor, which is costly. Wait states are used to interface DRAM to processors where the DRAM cannot respond at the speed of the processor. The CPU will hold the address and control lines for extra clock cycles to give the DRAM sufficent time to respond.

To build a system with zero wait state memory requires

CPU Pipelining is dividing the DEU logic into several stages. Using these seperate stages allows the processor to decode/execute several instructions at the same time. Immediate operands will be found in the previous DEU, thus this instruction can be executed without any reference to external memory or the internal instruction queue. In fact, several register type instructions could all be executed by different DEU at the same time, resulting in certain instructions having an execution time of zero clocks.

Cache memory is high speed (on or off-chip) which interfaces between the CPU and system DRAM. The CPU can access the instructions/data stored inside cache at a faster rate than those stored in system memory. A cache controller is used to pre-fetch instructions in order to keep the cache relatively full. This allows the processor to run at full speed (zero-wait state).

Most cache systems will run at about 98% effeciency, which means that 98% of all CPU requests for instructions/data are found in cache memory. The size of the cache is often the determining factor in this. The cache cannot be made to large, else it can overload the system bus, cost too much, and might never be full.

A CACHE HIT means the processor has found the required data/instruction inside the cache. A CACHE MISS is the opposite.

The cache works by using an Address Tag Comparator. This is a hardware counter used by the cache which is initially set to the same address as that output by the processor. On the 1st memory read by the CPU, the ATC loads itself with the same address. Whilst the processor is decoding/executing that instruction, the cache hardware controller accesses subsequent memory locations, gradually filling up the cache table. When the CPU issues the next address, the ATC checks to see if it is within the range of the cache table. If not, the CPU is connected dircetly to RAM, then the ATC updates itself and clears the cache table entries.

Performance is increased using cache memory, but factors affecting its performance are,

Programs with jump statements impede cache use by continually flushing the cache entries. Some processors support both instruction and data caches. This overcomes the cache becoming flushed by a memory write. When using a data cache, you need to implement some method of handling writes to locations which also appear in cache. The choices are

The write through method writes both to the data cache entry and system memory simultaneously. This prevents cache and system data getting out of step. The flush on delay method results in faster performance, as data written to cache might need to be used in the immediate future (loop count variables etc). It however, requires more complex hardware to implement.

The use of cache memory is employed using system DRAM which supports a pulsed CAS access scheme.

Nibble mode access is a way of accessing four sequential bits in a 4x1bit array. On the first access, the RAS and CAS lines are taken low, and the data is R/W in the cycle time (say 80ns). If RAS is held low, and CAS pulsed, the next three locations can be accessed without going through the required setup times. This means the next three locations can be accessed at a quicker rate (say 35ns). The cache controller provides the necessary timing signals for accessing the system DRAM using pulsed CAS nibble mode.


Memory Interleaving
Is a memory access technique which divides the system memory into a series of equal sized banks. These banks are expressed in terms of an interleave factor (eg, 2x, 4x etc).

interleaved memory diagram

Data is read 32 bits at a time (No need for A0 and A1). Upon a read, the first data is read using one wait state, whilst the second is read with zero wait state using external address pipelining by the interleave controller. The above diagram shows a two-times interleave.

The advantages of interleaved memory (for a system utilizing single wait state memory, ie, 1 wait state),

In essence, its a bit like pulsed CAS memory, in that the first access runs at normal speed, but subsequent accesses in other banks are accessed at a faster rate. The requirements for memory interleaving are,

In systems using DMA, it can have the advantage of multiple bank access, in that the CPU can be accessing one bank whilst the DMA device another.

Used a latched address system, the address bus can be changed once it has been latched/captured by the system memory. The next subsequent address can thus be set-up whilst the memory is still using the first.

Memory interleaving is directly supported by the iAPX386 processor. It performs address pipelining, which is placing the address of the next bus cycle before the current bus cycle is finished. This can happen in latched systems as there is at least a couple of clock cycles from the time the address is latched till the time that data becomes valid. It is during this period that the processor can be instructed to present the address of the next bus cycle on its address pins. It does this when the input pin NA is asserted by the interleave controller.


BANKED MEMORY
The early eight bit computer systems with their limited address range (64k) quickly ran out of memory. Programmers required greater amounts of RAM, but the processor could not access this. Several techniques evolved which increased the physical RAM in the computer system. Some of these were

Banked memory overcame the problem by assigning multiple banks of memory (all the same size) at a single address space. These banks could be switched in and out of the main address space of the processor by writing to a port or memory address. A bank could be switched in, data copied to it, then switched out again.

banked memory

Typical circuitry was,

banked memory circuitry

Only one ouput is active low for a secified d0-d2 input combination.

In the IBM-PC/XT, this arrangement is called EMS (Expanded Memory System), and uses a bank size of 64k managed as a series of pages (from 2k in size upwards). A special software driver (EM.SYS) is used to manipulate the memory banks, which are normally mapped in the region A0000 to F0000. This is because the iAPX86 has a limited 1mb of addressable memory. By placing commonly used utilities in EMS, this frees up more of the 640k system RAM for DOS. Note that software packages such as LOTUS support EMS memory.


EXTENDED MEMORY (AT/386)
Extended memory is placed above the 1mb boundary of the iAPX86 processor. AT's and 386's can address more memory (AT=16mb, 386=4gb). This memory is not accessable to DOS (which is limited to 640k), but can be configured for use as a RAMDRIVE (using VDISK.SYS or RAMDRIVE.SYS).

This memory is accessable only in PROTECTED MODE, but note that some software contains routines written in protected mode to enlarge the DOS workspace (a good example being AUTOCAD).


DYNAMIC BUS SIZING
The 32bit processors can automatically reconfigure themselves to suit the data bus size of different peripheral devices on a cycle by cycle basis. This is called Dynamic Bus Sizing. This means that to access 8bit memory for a 32bit register load, the processor will automatically run four consecutive bus cycles to obtain the 32bits of required data. Input pins to the processor can be used to change the size of the data transfer on a cycle by cycle basis.


APPROACHES TO PERFORMING I/O

I/O MAPPING (Intel series)
Rather than cluttering up the main address space, Intel processors provide a seperate address space for handling peripheral devices. A special control line IO/M selects the main address map or the port map.

The signal line is used to generate appropriate address decoding for the chip select signal.

io mapped ports

Note that only the low 16 bits of the address bus is used to select one of 65536 ports.

Some advantages are,

Some disadvantages are,

The 386 processor provides an I/O bit privledge per task. The privledge map can protect sensitive peripherals from specific tasks.


MEMORY MAPPING (Motorola)
The peripheral devices are located within the main address space of the processor.

memory mapped io

Some advantages are

Some disadvantages are


IO CHANNEL COPROCESSOR (IBM)
To allow concurrent operation of the CPU and I/O devices requires the use of a special I/O processor. The main CPU instructs the I/O processor to perform the required data transfer. When the transfer is completed, the I/O processor informs the main processor of the status of the operation.

This method frees the main processor to perform other tasks whilst I/O is being done (tasks requesting I/O are blocked by the OS and thus not scheduled for processor time).

Typical features of an I/O channel processor system are

There are two main types of IO channels

Both channels support a number of devices on a bus called a sub-channel.

The selector channel operates in burst mode only. It handles a single sub-channel at a time, and has very high transfer rates. Typically, it controls high speed disk units.

The multiplexor channel handles more than one sub-channel at a time by interleaving requests. It operates in byte and word mode, but does support burst at a much lower rate than a selector channel. Typically, it handles devices like printers and character terminals.

Channel Operation
The processor initiates an I/O transfer by setting up a special IOC program in main memory. It then issues a STARTIO instruction, which identifies the channel and sub-channel.

The channel then accesses and runs the channel program (the address of which is in location 72). When finished, the channel updates the IO flag in the processors status register to signal command completion. The processor then checks the channel status register for results.

Each channel gets informed of

The channel is a sophisticated DMA controller.

io channel co-processor


Processor Architectures

Accumulator
The processor consists of one or more accumulators which are used for data storage. The majority of instructions deal with transferring data memory and the accumulators. Data is manipulated once in the accumulators. An example of this architecture is the MC6802.

Instructions are executed according the fetch, decode, execute cycle. Instructions are generally single byte opcodes with 0, 1 or 2 operands.


General Register
The processor consists of a large bank of registers, which can be used in most of the available addressing modes (as index or data registers). All registers are the same size, and the majority of instructions deal with the manipulation of registers.

The instruction size is expanded, with opcodes 1, 2 or 3 bytes long. This reflects the necessity of encoding register and effective address fields into the opcode of the instruction.

The MC68000 family is an example of this architecture, using 3bit fileds for encoding the source and destination register fields.


Reduced Instruction Set Computers
Early computer systems were implemented using an ALU and control system. The control system was comprised of discrete logic devices. The interconnections between these logic devices proved to be very unreliable, difficult to design and debug.

In the early 1950's, a british pioneer named Maurice Wilkes came up with a system which implemented the control system via a block of fixed memory. Each column in the memory represented a control line, and the 0's and 1's in each row would set a specific state in the ALU. This simplified the design of control systems for processors, and became known as micro-programming.

Several trends started along the development of the CISC (complex instruction set computer). These were

Compiler writers wanted complex instructions so that the task of translation was easier and more effecient.

However, in the mid 1970's several researchers began to have doubts about CISC architectures. They believed that the complex instruction set actually reduced the real performance of the processor. It was discovered that the complex instructions were seldom executed, and that simple instructions predominated.

Complex instructions require complex decoding circuits, this leads to costly design and increased silicon space, which tends to slow the processor down (limits clock speed and complex instructions take many clock cycles to execute).

Designers set about rethinking what a processor should do. They came up with the following criteria,

It was argued that advances made in design had solved the problem of hardwired control systems (which means higher clock rates); that compilers would generate more effecient code if the instruction set was simple and consistent; and since the instructions executed in a single clock cycle, it should out perform CISC processors.

Programs would be longer because of the simpler instruction set, but the speed of the processor would make up for it, so the overall net result would be a lower execution time.

Processors recently developed which adhere to this criteria are called RISC. The advantages are,

An example of RISC are the MC88100 and IBM RS6000 processors.


Stack Architectures
The processor has a dedicated stack pointer and all operations are done on the top of the stack.

In a stack machine, there is a sequence of registers that are used in a special way. Imagine that these registers are called A[1], A[2] to A[n].

At the beginning of execution, there are no particular values associated with any of the registers, and a special stack pointer register STP contains the value zero. A load operation bumps STP and copies the contents of a memory location into A[STP].

1: if STP = n, signal a stack overflow

2: else STP = STP + 1

3: A[STP] = data

Conversly, a store operation first checks to see if the registers contain any useful information (if STP > 0). If nor, this indicates a stack underflow, else A[STP] is copied to memory and STP is decreased by 1.

Arithmetic is done by taking the contents of the last two occupied registers (A[STP] and A[STP-1]), combining them as specified by the instruction, and placing the result into A[STP-1]. Since two values have been removed, STP is decreased by 1 to point to the result.

Example operation
Consider the calculation of the formula

E = A * B + C * D

The values A and B are loaded onto the stack and their product is left at the top. Assuming that the values of A, B, C and D are 3, 4, 5 and 2 respectively (and stored in locations $50-$53 respectively), the program looks like,

LOAD 50
LOAD 51
MUL
LOAD 52
LOAD 53
MUL
ADD

The first instruction LOAD 50 leaves the stack like

A[1] = 3
STP = 1

The instruction LOAD 51 leaves the stack like

A[1] = 3
A[2] = 4
STP = 2

The instruction MUL leaves the stack like

A[1] = 12
STP = 1

The instruction LOAD 52 leaves the stack like

A[1] = 12
A[2] = 5
STP = 2

The instruction LOAD 53 leaves the stack like

A[1] = 12
A[2] = 5
A[3] = 2
STP = 3

The instruction MUL leaves the stack like

A[1] = 12
A[2] = 10
STP = 2

The instruction ADD leaves the stack like

A[1] = 22
STP = 1

An example of this architecture is an arithmetic co-processor like the 80387.


Intel Micro-Processors


1971	4004 4 bit DB=4/AB=8 Nibble wide, 256 bytes
1972	8008 8 bit DB=8/AB=8 Byte wide, 256 bytes
	8080 8 bit DB=8/AB=16
1978	8086 16 bit DB=16/AB=20 	1Mbyte, Segmentation (16x64k), 4.7Mhz
1983	80186 Built in PIC/Bus controller
1983	80286 16 bit DB=16/AB=24	16Mbyte, Introduced Protected Mode, 8-12Mhz
1986	80386 32 bit, DB=32/AB=32	4 Gbyte, Paging, 16-33Mhz
1981	80486 32 bit, DB=32/AB=32	Same as 386 includes 387, 33-50Mhz

home page next page
Copyright Brian Brown, 1991-2000, All rights reserved